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ABSTRACT 

Since the arrival of genetic typing methods in the late 1960’s, researchers have puzzled at the 
clinical consequence of observed strain mixtures within clinical isolates of Plasmodium falciparum. 
We present a new statistical model that infers the number of strains present and the amount of 
admixture with the local population (panmixia) using whole-genome sequence data. The model 
provides a rigorous statistical approach to inferring these quantities as well as the proportions 
of the strains within each sample. Applied to 168 samples of whole-genome sequence data from 
northern Ghana, the model provides significantly improvement fit over models implementing simpler 
approaches to mixture for a large majority (129/168) of samples. We discuss the possible uses of 
this model as a window into within-host selection for clinical and epidemiological studies and outline 
possible means for experimental validation. 

INTRODUCTION 

The protozoan parasite Plasmodium falciparum (Pf) is the cause of the vast majority of fatal 
malaria cases, killing at least half a million people a year |16l H2l 148] . The parasite’s ability to 
develop resistance to drugs and the rapid spread of that resistance across geographically-separated 
populations presents a constant threat to international control efforts iZlEHllMl. While research 
has elucidated many genetic factors in resistance, much of genetic epidemiology of the parasite - 
including the effective recombination rate and the rate of gene flow across populations - is still 
unclear IMIIMIET]. 

Since the late 1960’s, researchers focused on the structure of Pf clinical infections have struggled 
to understand the implications of multiplicity of infection (MOI), where multiple strains appear 
to be present within a single patient’s bloodstream |46| 125] IT9l [T] |3]. While MOI-focused studies 
implicate increased or decreased levels of MOI with a range of conditions, including clinical severity 
[29] . age-specific severity dZl uni El E], parasitemia levels during pregnancy [5], and other effects 
miiiiiMiEi], there is no broad consensus about MOFs role, if any, in controling the course of an 
infection. Still, a wide variety of studies and genetic assays ~ most typically by typing of the mdr 
gene - show MOI as a reliable feature of clinical Pf isolates |3]. 
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The appearance of whole-genome sequencing (WGS) technologies applied to Pf extracted di¬ 
rectly from infected patients’ bloodstreams provides an unprecedented window into the structure 
of genetic mixture within samples j23[ [2]. Initial work on understanding the structure of this mix¬ 
ture data shifted focus from estimating MOI to analysis based on inbreeding coefficients IHlEIllls]. 
These metrics, a special type of T-statistic, provide an estimate of the departure of within-sample 
allele frequencies from those expected under a Hardy-Weinberg-type equilibrium with the nearby 
population. In this perspective, each patient’s bloodstream is taken to be a subpopulation exhibit¬ 
ing a degree of admixture of all of the strains from the local environment, ranging from a perfectly 
random sampling of all nearby strains to the repeated sampling of just single strain. 

The initial study applying WGS to clinical Pf isolates collected from eight countries on three con¬ 
tinents shows that the parasite exhibits significant population structure at continental scales, with 
the amount of subpopulation structure varying significantly among regions [23]. Employing novel 
F-statistics to measure the inbreeding coefficient, this work also argued that the degree of mixture 
varies significantly across populations, with highly mixed samples occurring relatively frequently in 
west Africa but only occasionally in Papua New Guinea. Importantly, the authors suggested an 
association between increased levels of observed mixture with increases in transmission intensity in 
the local environment. Transmission intensity, the rate at which individuals are infected with Pf, 
likely determines some part the frequency of out-crossing within parasite populations and so may 
be critical to understanding gene flow and strategies for resistance control |15) . 

In this paper, we present a new statistically rigorous model that synthesizes these two distinct 
and previously disparate approaches to analyzing P. falciparum clinical mixtures: assessing the 
number of distinct genetic types within a sample (the MOI approach) and measuring the degree 
of panmixia with respect to the local population (the inbreeding coefficient approach). The model 
centers around how these two sub-models contribute to generate the observed within-sample non¬ 
reference allele frequency as it relates to the population-level non-reference allele frequency for single 
nucleotide polymorphisms (SNPs). For clarity, we will deprecate the use of non-reference in front 
of the term allele frequency, since they are all calibrated in this fashion. We will use the acronyms 
WSAF to denote the within-sample allele frequency and PLAF to denote population-level allele 
frequency to avoid confusion about the particularly frequency being indicated. 

The essential structure of the model is to explain observed ‘bands’ that emerge when examining 
the WSAF for SNPs as a function of their PLAF (Figure]^. The model posits that the number 
of these bands results as a direct consequence of the number of distinct strains present within a 
sample and that the degree of admixture with the local population determines bands’ slopes. To 
distinguish from the inbreeding coefficient, we refer to the degree of admixture as the panmixia 
coefficient. The collection of bands are then modeled jointly as a finite mixture. 

Figure lays out the important components of the model. In the simplest case a sample 
is composed of a single, unmixed strain, and all SNPs exhibit a WSAF of zero or one (see Figure 
j^a)), depending on whether they agree with the reference. Gonsequently the WSAF is independent 
of PLAF, leading to two flat bands at these values. We call these samples unmixed. In the case 
where finite number of strains mixed within a sample, then for each variant position some number 
of those strains will possess a reference allele and some will not. Which strains carry non-reference 
alleles and those strains’ proportion in the sample mixture then determine the WSAF for each SNP. 
Observed across many SNPs, this leads to the apparent bands of constant WSAF across the PLAF. 
It follows that for K component strains there are 2^ possible combinations of biallelic states, leading 
to that number of apparent WSAF bands. 
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The complementary banding structure of panmixia arises when a fraction of the Pf organisms 
present within the blood are randomly sampled from the local population. In its simplest formula¬ 
tion, the panmixture model represents the admixture of two distinct Pf populations: a single strain, 
representing 1 — a of the within-sample genomes, and a random sample of strains from the local 
population, representing a of the remaining genomes. In the case of perfect panmixia (a = 1), a 
sample would be comprised of organisms evenly sampled from the ambient population and the plot 
of WSAF against PLAF would become a single line at y = x. In this data set, we do not observe 
any sample close perfect panmixia but observe several instances of apparent partial panmixia with 
a single dominant strain (Figure [^c) and|^c)). The a tilt in the WSAF arises from the fact that 
for this proportion of organisms the probability of sampling non-reference allelele is proportional to 
the PLAF, absent any other population structure. These samples, with a single strain and a degree 
of panmixia, we call panmixed. In more complex cases, where there is more than one dominant 
strain, the total number of bands is still determined by the number of these strains. However, now 
panmixia tilts each of these bands equally leading to complex mixtures (Figures [^d) andj^d)). The 
paper proceeds as follows. We first detail the structure of the WGS data, introduce some notation, 
and the essential mathematical structure of the model. We then present an extensive simulation 
study on the performance of the model and then an examination of its application to field isolates 
collected from northern Ghana. We conclude by discussing the strengths and weakness of the model, 
some possible improvements, and what consequences this analysis may have for understanding the 
etiology of clinical malaria. 


DATA, NOTATION, AND MODEL 

Data 

The WGS data come from Illumina HiSeq sequencing applied to P. falciparum extracted from 235 
clinical blood samples collected from infected patients from the Kassena-Nankana district (KND) re¬ 
gion of Upper East region of northern Ghana. Gollection occurred over approximately 2 years, from 
June 2009 to June 2011. The full sequencing protocol and collection regime are described in |23j . Af¬ 
ter quality control measures, sequencing was performed on 235 samples, and, following a documented 
protocol using comparison against world-wide variation, 198,181 single-nucleotide polymorphisms 
(SNPs) were called within each sample |2^. Each call provides the number of reference and non¬ 
reference read counts observed at each variant position within the genome, ascertained against the 
the 3D7 reference |in) . For this project, we additionally filtered these data. First, multiallelic posi¬ 
tions were reclassed as biallelic. We then excluded positions that exhibited no variation within the 
KND samples, any level of missingness (no read counts observed), or minor allele frequency less than 
0.01. To remove low quality samples, we removed thoses possesed more than 4, 000 SNPs called with 
fewer than 20 read counts, following an inflection point observed in Supplementary Figure Sl(a). 
These cleaning measures left 2,429 SNPs in 168 samples. More than 95% of remaining samples’ 
sequencing was completed without PGR amplification. We observe little apparent population struc¬ 
ture among the samples, evidenced either by principal components analysis or a neighbor-joining 
tree of the pairwise difference among samples (Supplementary Figures S2). The data preparation 
scripts are available with the source code for the model, https://github.com/jacobianl980/pfmix/. 
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Notation 

Following the data preparation and cleaning, our analysis begins with a set of N clinical samples, 
each composed of M SNPs. At each SNP j within each clinical sample i, we observe rjj reads that 
agree with the reference genome and riij reads that do not agree with the reference. For a sample 
j. we write the complete data across all SNPs as Vi = [{rn^nn)^ • • • , {riM-,niM)]- For each SNP j, 
we associate a PLAF pj. The collection of all pj we refer to as V. 

Conditional upon the number of strains AT, there are 2^ bands, indexed by r = 1, • • • , 2^. The 
full collection of bands we call Q, with qijj. indicating the WSAF for band r at SNP j in sample i. 
The probability of a SNP lying within the distinct bands across the PLAF is specified by a mixture 
component Xr, which is a function of the PLAF, and so is often written \r{pj). The degree of 
panmixia in a sample is given by a, a value between zero and one. A complete list of the model 
parameters is given in Table 

Model 

Statistically, the model takes the form of a finite mixture model, with the mixture components 
associated with individual bands |36l I26| . We take a Bayesian approach to inference and so lay out 
the model by giving an overall rationale for the decomposition of the posterior distribution and then 
justifying the appropriate choice of probability distributions for each of the terms m 

Decomposition 

We assume that samples are independent of each other and that the SNP data for each sample 
depends solely on K, the WSAF Q, the PLAF V, and a shape parameter v. As samples are 
independent, we will deprecate sample-specific subscripts for the model parameters. Considering 


the data for a single sample, Vi, the posterior distribution can then be written as: 

F{Q,V,W,a,iy,K\Vi) oc F{Vi\Q,V,W,a,i^,K) ■FiQ,V,W,a,i^,K) (1) 

= F{Vi\Q,V,i^,K)-F{Q,V,u,K,W,a). (2) 

We also assume that the WSAF, Q, depends only on the PLAF, V, the panmixia coefficient a, 
the number of strains K, and their proportions within the sample, W, allowing the right-hand side 
of Equation]^ to be further decomposed, by noting that 

F{Q,V,v,K,W,a) = F{Q\V,u,K,W,a) ■F{V,v,K,W,a). (3) 

While W clearly depends on the number of strains, K, the remaining parameters we take to be 
independent of this value and of each other. This means that the last right-hand side term in 
Equation becomes: 

F{V,v,K,W,a) = F{V)-F{iy)-F{W\K)-F{K)-F{a). (4) 

Substituting Equations and into Equation yields the final decomposition: 

F{Q,V,W,a,u,K\Vi) oc F{Vi\Q,V,u,K)-F{Q\V,u,K,W,a)- 

F{V)-Fiiy)-FiVVlK)-FiK)-Fia). (5) 


We now specify each of the terms on the right-hand side above as probability distributions. 
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Likelihood : ¥{Vi\Q,VK) 


Within band r, the WSAF at SNP j in sample i is qijr- Supposing that read counts at j are 
identically and independently distributed with probability qijr^ we model the probability of the 
data {vij^riij) as a Beta-binomial distribution, allowing us to model greater dispersion than expected 
under a pure binomial. We parameterize this distribution in terms of qijr and u rather than the 
more commonly used shape and scale parameters, a and j3. The relationship between the two 
parameterization is q^jr ■ v = a and (1 — qijr) ■ v = j3. We use this parameterization as it allows us 
to write the model in terms of the mean allele frequency that defines each band. The additional v 
is a shape parameter that serves as a proxy for the variance. These parameters give a likelihood 
expression: 


'^{nij,rij\r,qijr, v) 


f riij + rij\ 

[ n, ) 


B(?rjj -|- qijr ■ I'tT'ij -\- (1 qijr) ■ ^) 
(f Qijr) ' ^) 


( 6 ) 


where B is the beta function. 

As any SNP could lie within any band, we employ a novel version of the finite mixture model 
to capture this segregation. Fixing the number of strains to K, there are then 2^ ways that the 
strains can be assorted into non-reference and reference allele states at any given position j. A given 
band r arises from Cr strains exhibiting the non-reference allele and 2^ — Cr strains exhibiting the 
reference allele. Supposing no population structure among the strains, the probability that a given 
SNP will be in that band is simply the probability of drawing Cr non-reference alleles and 2^ — Cr 
reference alleles, conditional upon pj: 


P(SNP j G band r\pj) = p^'" ■ (1 — 

= K{Pj)- 


Consequently, the density of the mixture coefficients for each band varies across the PLAF but such 
that they sum to 1 across all bands at any position j: 

T(T>ij|Q,T’,zz, AT) = ^F{SNP j G hand r\pj) ■F{nij,rij\r, qijr, 1^) 

r=l 

2^ 

= '^K{pj) ■F{nij,rij\r,qijr,i^). 

r=l 

Following from the assumption of no population structure, SNPs will assort into bands indepen¬ 
dently. This leads to a product form for the likelihood of sample’s data. Dp. 


F{Vi\Q,V,u,K) 


M r 2^ 

L 

j=l L r=l 


n E {Pj) ■ Fi^Tiij,rij\r, qijr, v) 


( 7 ) 


Band structure: F{Q\V,v,K,yV,Q.) 

The full mixture model contains two distinct subcomponents that we call the simple mixture model 
and the panmixture model, respectively. Both models generalize the unmixed case, though natu¬ 
rally in different ways. We first describe the unmixed model and then layout the two extensions 
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before showing how these can be combined to create the full model. In practice, we only fit data 
using the full model and allow it to indicate the number of strains, their proportions, and the degree 
of panmixia. We do not know the number of strains a priori so we employ metrics applied to the 
posterior distribution inferred with different values of K to determine it. However, for the purpose 
of detailing the model, we assume that K is known. 

Unmixed model - In an unmixed sample only one strain is present and there is no panmixia, and 
so K = 1 and 0 = 0. Consequently, we expect all SNPs to exhibit WSAF either zero or one (Figure 
j^a)) depending on whether they agree with the reference or not. There are then two bands, r = 1, 2 
and Qiji = 0 and qij 2 = 1. 


Simple mixture model - The simple mixture model assumes that a finite number K of distinct 
strains, si, • • • , are combined together in the sample with proportions, W = [tci, • • • , wk] but 
that a = 0. Naturally, = 1 and for each SNP j, the probability of being within band r is 

given by Xr{pj), as above. Band r is defined by a vector Vr = [l{sier})''' > where l{sj.gr} 

is an indicator function of whether strain k exhibits a non-reference allele within the sample. The 
WSAF of Qijr is then given by the sum of all of proportions of strains that exhibit a non-reference 
allele: 

K 

9ijr = Wk • l|5j.gr}- (8) 

k=l 

Taken across all r bands, this leads to 2^ bands with zero slope and corresponding proportions 

{0,Wi, ■ ■ ■ ,WK,Wi + W2,Wi +W3,--- , 1 ). 

Panmixture model - As mentioned above, in its simplest case, the panmixture model represents 
the admixture of two distinct Pf populations: a single strain, representing 1 — a of the within- 
sample orgnaisms, and a random sample of strains from the local population, for the remaining a 
organisms. When a = 0 the model reduces to the unmixed case. We will refer the single strain 
as the dominant strain, although, conditional upon a, it may represent only a small proportion 
of the sample’s population. For each position j, there are still only two bands: the higher one 
corresponding to the non-reference allele being present in the dominant strain, and the lower one 
corresponding to its absence. However, the WSAF for these bands varies according to pj with slope 
a. To resolve qijr-, first consider the upper band, r = 2. At any position j, 1 — a of the reads come 
from the dominant strain. The remaining reads, each sampled randomly from the local population, 
each have probability pj of being non-reference. This leads to qij 2 = (1 — a) + a ■ Pj- For the lower 
band, the dominant strain contributes no non-reference reads so qiji = a - pj. 

Complex mixture model - The complex model synthesizes the simple mixture and panmixture 
models. In this case, at position j, a of the reads are sampled randomly from the across the local 
population, contributing a fraction of a ■ pj non-reference alleles. The state of the remaining reads 
are determined by W as in EquationFor band r at position j, the WSAF is then given by 

qijr = (l-a)- ('^Wk-l{s^,(^r}] +a-Pj- ( 9 ) 

^ k=l ' 

There are then 2^ bands with proportions (0, tci, • • • , wk,wi + W 2 ,wi + W 3 , ■ ■ ■ ,1) and slope a. 
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Priors 


For the remaining four probability distributions we place the following vague prior distributions: 



W|A 

~ DIRICHLET(li^) 


a 

~ UNIFORM(0,1) 


V 

~ EXPONENTIAL (5) 


K 

~ zero-truncated POISSON(2) 

where lii: is 

a vector of K ones. 


Inference 




We infer the model parameters using a standard Bayesian Markov chain Monte Carlo (MCMC) 
approach |13[ [12] with one exception: we first calculate maximum-likelihood estimates (MLE) for V 
across all samples and then treat these as fixed when inferring the remaining parameters [38] . This 
choice is motivated by statistical expedience and computational speed. Except for "P, the parameters 
of the model are independent across samples and so this approximation enables the algorithm to 
infer parameters in parallel rather than jointly. This avoids the difficulties of performing inference 
on the number of strains within samples simultaneously, which would require an involved trans- 
dimensional MCMC scheme (such as reversible jump MCMC, |14] 1 acting jointly across all samples. 
Running in parallel also increases the computational speed of the implementation by at least an 
order of magnitude. Since the sample collection is large enough that V is nearly independent of any 
given sample, we do not expect this approximation to significantly bias inference. 

For each SNP j, the MLE derives from treating the non- and reference reads within a sample 
as coming from a binomial distribution with parameter pj. This leads to: 

N , N 

Pj = ^{riij+rij). 

i i 

To infer K for each sample, we employ a Bayesian Information Criterion (BIC) |35[ |6] and harmonic 
mean estimator to the Bayes Factor (hmeBF) |221 [50] as metrics for model selection. To find the 
maximum likelihood value for use with the BIC, we initially implemented a separate estimation 
algorithm but found no significant difference with using the highest value observed from the posterior 
samples. In simulations, we observe that the BIC and hmeBF provide similar guidance, with the 
BIC frequently indicating a smaller K. For the simulation study and empirical data example, we 
provide only the BIC result. 

Conditional on V and K, we implement a Metropolis-Bastings algorithm to draw samples from 
the posterior distribution m For each of the three parameters, a, W, and ly, we propose new values 
directly from the prior distribution, leading to Metropolis-Bastings ratios almost solely dependent 
on the ratio between the likelihood and priors for the proposed state to those for the current. 
The inference scheme is implemented in set of scripts for the R computing language, and can be 
found under the Academic Free License at https://github.eom/jacobianl980/pfmix/s. For a single 
sample, a sufficiently long MCMC run takes approximately 20 minutes on a single high-performance 
computing core. 
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RESULTS 


Simulations under the model 

To demonstrate the efficacy of our implementation, we present a simulation study examining the 
algorithm’s performance. We consider two distinct aspects of the inference separately: how well 
the model infers the number of strains, and, conditional upon that number, how well it infers the 
model’s other parameters. We simulate data from the model in the following way. Conditional upon 
M,a, K and the sum of the read counts, C, we draw a vector of probabilities, W, from a uniform 
Dirichlet distribution. We combine the values of W in all possible permutations to create the 2^ 
bands and assign the PLAF for the SNPs evenly from 1/M to 1, so that the SNP has PLAF 
For each SNP, we first probabilistically select the band it occupies according to according to 
Equation!^ Conditional upon selecting r, we then simulate read counts according to the likelihood 
(Equation with qijy according to Equation For all simulations, we set v = 10. We run the 
simulation across the range of values for M,a, K and C. For each parameter set, we create 10 
independent realizations. 

Number of components 

Figure shows performance of the algorithm for inferring the number of components increases in 
precision with the number of SNPs and the number of reads. Conditional upon a, the simulations 
indicate that the number of SNPs, M, to be the largest determinant of performance, with the sum 
of the read counts, C, playing an important supporting role. Inference of the number of underlying 
strains, K, is generally strong for low panmixture levels (small a values), but is noticeably more 
conservative for a = 0.5, likely due to the bands becoming increasingly tightly packed as panmixia 
increases. In general, inference is slightly conservative, likely owing to the BIC estimator’s bias 
toward parsimony |9]. 

Parameters 

Figure shows similar performance for inference of the strain proportions W and a. For W, we 
report the mean squared deviation. For a, we report the absolute normalized deviation to account 
for relative difference from the true value. For both parameters, we observe that the number of SNPs 
is the strongest determinant of accuracy, with M = 150 ensuring moderately strong performance. 
High a moderately decreases the quality of inference for the strain proportions. 

Clinical samples from northern Ghana 

Applying the algorithm to the 168 high-quality samples from KND, we observe K range 1 to 7, with 
a falling between 0 and 0.14, and a moderate correlation between K and a (Figure]^. The largest 
subset of samples were unmixed, with K = 1 and a < 0.01, though the majority of samples exhibit 
moderate levels of mixture, with K = 2, 3,4 and 0.01 < a < 0.03. A small number of samples 
exhibit complex mixtures, with K > A and a typically greater than 0.02. These results confirm the 
presence of mixtures within Pf clinical isolates, but also indicate, possibly more complex patterns 
involving interactions between the number of dominant strains and the degree of panmixia. 

We observe that for most samples the 95% credible interval for a is within a small percentage of 
the median value. For Wj, we observe a similarly tight posterior distribution, particularly for samples 
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with K < 3. The posterior uncertainty increases together with increasing K and increasing a. For 
a small number of samples, the model initially produced unusually low values for v, indicating a 
bimodal Beta-binomial distribution inconsistent with a mixture of strains and consequently suspect 
values for K. For these, we bounded v such that it ensured a unimodal distribution and then 
recovered results consistent the remaining samples. 

To visually inspect the quality of the results, we generate hgures for each of the samples showing 
the observed WSAF and PLAF data, the inferred model structure, and data simulated under the 
inferred model following the observed PLAF. We show examples of these plots for three typical 
samples in Figures Nearly all samples (158/168), across all different mixture patterns, show 
strong visual correspondence between the observed and model-simulated data. We also observe 
a strong correlation between the inferred number of components and a quasi-maximum likelihood 
estimate for the inbreeding coefficient for each sample (Figure 0 m 

For each sample, we compare the full model to two reduced versions under the restrictions 
a = 0 and K = 1, respectively. These restrictions correspond to the cases where the model 
becomes the simple mixture model, with 2 ^ bands but no admixture with the local population, 
and the panmixture model, where there is a single strain with some local population admixture, 
respectively. For numerical stability reasons, we set the former restriction as a. = 0.001. For 60% 
of samples (108/168), the BIC selects the full model over either of the restricted models. For 58 
samples the BIC criterion selected the K = 1 restriction, while for 23 samples it selected a = 0. 
Taken in aggregate across all samples, the criterion overwhelmingly selects the full model over either 
of the restricted models. 


DISCUSSION 

The model captures two dimensions of within-sample mixture for P. falciparum that had previously 
modeled separately: the number of strains and the degree of admixture with the local population. 
Evidenced by the comparison of the full model with restricted sub-models, this approaches provides 
a marked improvement over both more restricted approaches in capturing the structure of mixture in 
clinical samples. While the model provides a more involved qualitative understanding of the samples, 
the strong correlation between the inbreeding coefficient and the inferred number of strains shows 
that the model produces results consistent with previous methods. 

In order to perform inference, the model makes a number of simplifying assumptions that may be 
violated in practice. The model presumes that SNPs are unlinked and consequently independent for 
the purpose of calculating the likelihood. Given the high recombination rate of P. falciparum this 
assumption may hold for the majority of pairs of SNPs, but neglects correlations that appear locally 
(~ 10 kB). However, we expect that this independence assumption serves to moderately weaken the 
inferential power of the model rather than cause any type of bias since it fails to include possibly 
informative data, rather than posit a possibly misspecihed model. More problematic is the model’s 
implicit assumption of limited population structure. In the case of the KND samples, and perhaps 
in much of west Africa, this assumption appears supported mm- In other contexts, specihcally 
south-east Asia, recent population bottlenecks and selection suggest that this assumption will be 
violated m The consequences on this model inference are unknown but can likely be partially 
resolved with appropriate simulation studies. 

The model presents an important new tool for interrogating the biology of clinical Pf infections. 
In particular, how the number of component strains and the panmixia coefficient relate to the 
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infection parameters, snch as seasonality, transmission intensity, and ontcrossing, and evolntionary 
parameters snch as the rate of change within sections of the Pf genome, conld have implications 
for nnderstanding the genetic epidemiology of Pf. The model also presents a means for clarifying 
the poorly detailed strnctnre of intra-host infection dyanmics, snch as strain selection or density- 
dependent selection |21| . by resolving how the nnmber of strains, the mixtnre proportions, and the 
panmixia coefficient change within the conrse of an infection or in response to drng intervention. 

An nnexpected strnctnral conseqnence of the model is that power to infer additional strains 
diminishes as the panmixia coefficient (a) increases. This results from the simplifying assumption 
that 1 — a of the reads come from the dominant strains while the remaining a fraction are sampled 
randomly from the local population. Geometrically, we see that as a increases that the bands will 
get progressively closer together as they approach panmixia, making them harder for the model to 
distinguish. Also, as a increases to one, the fraction of reads representing the dominant strains 
diminishes, reducing power to infer these components. We observe this pattern strongly in simula¬ 
tions (Figure]^: for a = 0.5 or greater, the model consistently infers too few components. While 
this deficiency may be overcome in some fashion with additional SNPs or read counts, the geometry 
indicates there may be fundamental limits on any model’s ability to discriminate the true number 
of components in the high panmixia regime. 

In principle, the model can be explicitly tested against experiment. Laboratory facilities with 
the capacity to store many field strains (> 100) could generate artificial samples in an experimental 
analog of our simulation procedure, as follows. Starting with N unmixed strains, identified using 
inbreeding coefficients, they could create mixtures, they would need to first fix the required se¬ 
quencing volume as rj, and the parameters for panmixia (a), number of component strains (iF), and 
their mixture parameters, S and W. For the finite mixture component, they would then combine 
volumes of t/-W from the K strains. For the panmixture component, they would then fix some large 
number but experimentally feasible number of strains (say 100) and randomly sample from all of 
them a volume of r//100. Finally, combining these into final sample and applying WGS sequencing, 
will yield data that we hypothesize will closely follow the integrated model outlined above, with v 
capturing the experimental variation. Naturally, consistent results would indicate the sufficiency of 
the model, but not it’s necessity, holding out the possibility of a more minimal description. These 
results could be further compared against other next-generation technologies, such as single-cell 
sequencing, that have been deployed to understand Pf clinical mixtures |3n) . 

The method works efficiently in practice (the properties of a single sample can be inferred in 
minutes on a standard laptop) but a number of possible improvements could strengthen its statisti¬ 
cal performance. Most immediately, creating a full Bayesian approach rather than the parallelizing 
implementation here - while likely not improving the parametric inference for individual samples - 
would provide the full posterior distribution across all samples for more considered model compar¬ 
ison. In that same line, more refined approaches to inferring the number of strains within samples, 
either via a reversible jump MGMG approach or methods for rigorously estimating Bayes factors 
|14) . would provide researchers more power to resolve the structure of mixture, though likely at the 
cost of significantly more computation. 

The model does not perform haplotype phasing to resolve the sequence of the underlying strains 
ga da [32]. The analysis here suggests that a method for estimating haplotypes would be straight¬ 
forward for some samples (unmixed ones, for instance) but difficult if not impossible for others 
(when a is greater than 0.5). Researchers may be particularly interested in whether, in these 
phased samples, particular stretches of the genome appear more or less frequently in the dominant 
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strains than others, indicating immnnological or environmental selection. This is also an avenne for 
statistical development. 

The two phenomenologies of mixtnre that the model captnres - a finite mixture of distinct strains 
and an inbred population admixture - cannot be immediately associated with any specific aspect 
of the infection process. A number of variables appear plausible in determining these relationships, 
including transmission intensity, the length of infection, the immunological status of the infected in¬ 
dividual, and within-host density dependent selection. Together with WGS data, this new approach 
can serve as a means for biological researchers to directly resolve these hypotheses and resolve the 
consequence of mixture in P. falciparum infections. 
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Tables 


Parameter 

Definition 

N 

Number of samples 

M 

Number of SNPs 

K 

Number of strains 

i = l,--- ,N 

Index for samples 

j = !,■■■ ,M 

Index for SNPs 

r = l,---2^ 

Index for bands / strain mixtures 

Pj 

(Non-reference) allele frequency for SNP j 

V = [Pj] 

The PLAF for all SNPs 

Q = [qij] 

Within-sample allele frequency for SNP j in sample i 

a 

Degree of panmixia within a sample, panmixia coefficient 

•S = [si, • • • , sk] 

Strains in a sample 

II 

Strain proportions in a sample 


Band proportions within sample 

u 

Variation parameter for Beta-binomial 

WSAF 

Within-sample allele frequency 

PLAF 

Population-level allele frequency 


Table 1: Parameters and definitions for the model and its description. 
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Parameter 

Values: 

M 

50 

150 

500 

2500 

C 

10 

25 

100 

250 

a 

0.01 

0.1 

0.5 


K 

1 

3 




Table 2: Table of simulated parameter values: C the number of read counts while M, K and a. are 
as in Table 1^ 

Figures 





rTTTT— 

v* c... V*..■ 







(d) 


0.2 0.4 0.6 0.8 1 

Population-level allele frequency 


Figure 1: Four representative samples with WSAF for each SNP plotted against the PLAF, showing 
an absence of mixture (a), a partially panmixed sample (b), a simple mixture (c), and a complex 
mixture (d). 
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No Mixture 


Panmixture 




Compiex Mixture 



Population allele frequency 


Figure 2: The essential structure of the model comprises four distinct states, relating the WSAF 
to the PLAF: no mixture (upper left); simple mixture (lower left); panmixture (upper right); and 
complex mixture (lower right). 



Figure 3: Performance for inference of number of components 
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Figure 4: Performance for parameter inference: (a) mean squared deviation for W by number of read 
counts; (b) mean squared deviation for W by number of SNPs; (c) absolute normalized deviation 
for a by number of read counts; and (d) absolute normalized deviation for a by number of SNPs. 
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Number of inferred strains per sample Number of inferred strains per sample 


Figure 5: The frequency of number of inferred strains per sample (lef) and the posterior median 
value of a by the number of inferred strains (right). 


15 











Observed data 


Inferred model 


Model-simulated data 



Population-level allele frequency 


Figure 6: Examples of real data. For three samples (rows), we present the observed data WSAF 
plotted against the PLAF (first column), a diagram of the inferred model indicating the bands, 
proportions, and a (second column), and data simulated under the inferred model, a and W 
are the maximum a posteriori values. In the second column, the model’s PLAF-varying mixture 
densities are shown in grey scale, with black equal to one. 



1 2 3 4 5 6 7 

Number of inferred strains per sample 


Figure 7: Boxplot of the F-statistic inbreeding coefficient (1 — F^g) for each sample grouped by the 
number of inferred strains. 
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